[jpa-spec users] [jsr338-experts] Re: support for multitenancy

From: michael keith <michael.keith_at_oracle.com>
Date: Wed, 28 Mar 2012 10:12:23 -0400

On 27/03/2012 6:15 PM, Linda DeMichiel wrote:
> Hi Mike,
>
> On 3/27/2012 11:47 AM, michael keith wrote:
>> Hi Linda,
>>
>> Thanks for writing all this up.
>>
>> Some comments inline.
>>
>> -Mike
>>
>> On 26/03/2012 7:54 PM, Linda DeMichiel wrote:
>>> One of the main items on the agenda for the JPA 2.1 release is support
>>> for multitenancy in Java EE 7 cloud environments.
>>>
>>> In Java EE 7, an application can be submitted into a cloud environment
>>> for use by multiple tenants in what can be viewed as a basic form of
>>> software as a service (SaaS). The application is customized and
>>> deployed on a per-tenant basis. At runtime, there is a separate
>>> application instance (or set of instances, e.g., in a clustered
>>> environment) per tenant. The instances used by different tenants are
>>> isolated from one another. The resources used by a tenant's
>>> application may also be isolated from one another, or may be shared.
>>> In general, however, it is assumed that a tenant's data is isolated
>>> from that other tenants.
>>
>> That seems like the right default assumption to make. A config option
>> made to/by the resource consumer (in our case the JPA provider) would
>> override that assumption, I suppose?
>>
>
> Whether a database was used by multiple tenants (with isolated data)
> would
> be determined by the tenant (e.g., an SLA might stipulate a separate
> database,
> assuming that option were available, or the tenant's deployer's might)
> or would
> be determined by the cloud platform provider. I don't see the JPA
> provider
> overriding, at least in Java EE 7 case, but maybe I'm missing what you
> intended.
>

Yes, I didn't mean that the JPA provider would actually do the
overriding, but
that some config option passed to it might dictate the strategy. I agree
that the
tenant, deployer, or cloud provider would determine that.

>>> There are three well-known approaches to support for multitenancy at
>>> the database level:
>>>
>>> (1) separate database approach
>>> (2) shared database / separate schema approach
>>> (3) shared schema / shared table approach
>>>
>>>
>>> To get the discussion started, this is a high-level strawman sketch of
>>> how the 3 approaches might be used with JPA in keeping with the Java
>>> EE 7 approach. At the same time, however, we also want to be sure
>>> that what we specify in JPA 2.1 can be extended to encompass a more
>>> general approach to SaaS in the future in which a single application
>>> instance serves multiple tenants and in which multitenancy is managed
>>> by the Java EE environment.
>>>
>>> For further information on how Java EE 7 is approaching PaaS/SaaS, you
>>> might find the documents on the javaee-spec.java.net project useful,
>>> particularly
>>> http://java.net/projects/javaee-spec/downloads/download/PaaS.pdf
>>> and the latest draft of the Java EE 7 Platform spec,
>>> http://java.net/projects/javaee-spec/downloads/download/JavaEE_Platform_Spec.pdf.
>>>
>>>
>>> Note that the identifier for the tenant will be made available to the
>>> application in JNDI as java:comp/tenantId. The tenantId will be a
>>> string, whose max length should allow it to be portably stored in a
>>> single database column.
>>>
>>>
>>> APPROACHES:
>>>
>>> (1) Separate database approach
>>>
>>> In this approach, each tenant's persistence unit is mapped to a
>>> separate database. This approach provides the greatest isolation
>>> between tenants and does not impose any additional constraints over
>>> the object/relational mapping of the persistence unit or over the
>>> operations that can be performed. In particular, the use of
>>> multiple database schemas or catalogs are supported as are native
>>> queries.
>>>
>>> In some cloud environments, use of this approach might not be
>>> available, as a tenant might be allocated storage within a database
>>> rather than a separate database.
>>
>> So this is basically what JPA assumes today.
>>
>
> Right
>
>>> (2) Shared database / separate schema approach
>>>
>>> In this approach, each tenant's data is stored in database tables
>>> that are isolated from those of any other tenant. In databases that
>>> support schemas, this will typically be achieved by allocating a
>>> separate schema per tenant. The database's permissions facility is
>>> used to confine a tenant's access to the designated schema, thus
>>> providing isolation between tenants at the schema level.
>>>
>>> Support for this approach is straightforward if the persistence unit
>>> uses only the default schema or catalog (i.e., if it does not specify
>>> schema names or catalogs in the object/relational mapping metadata).
>>> A native query that attempts to access data in a schema other than
>>> that assigned to the tenant by the platform provider will be trapped
>>> by the database authorization mechanisms and will result in an
>>> exception.
>>>
>>> [While the case where the persistence unit metadata explicitly
>>> specifies one or more schemas could potentially be handled by the
>>> persistence provider by remapping schema names and native queries that
>>> embed schema names, I would not propose that we specify or require
>>> support for this case, although a more sophisticated persistence
>>> provider might choose to support it.]
>>
>> So, in summary, portable apps may not specify a schema or catalog at
>> any level:
>> mapping (annotation or XML), mapping file, persistence unit default,
>> or in a native query.
>>
>
> Right
>
>>>
>>> (3) Shared table approach
>>>
>>> In this approach, database tables are shared ("striped") across
>>> tenants.
>>>
>>> It is the reponsibility of the persistence provider to provide
>>> per-tenant isolation in accessing data. This will typically be done
>>> by mapping and maintaining a tenant ID column in the respective
>>> tables, and augmenting data retrieval and query operations, updates,
>>> and inserts with tenant IDs. The use of native queries would need to
>>> be trapped by the persistence provider and not allowed unless the
>>> persistence provider were able to modify them to provide isolation of
>>> tenant data.
>>
>> So, portable applications could not use either schemas or native
>> queries in
>> this mode, and there will be an opportunity for the application to be
>> able
>> to map the tenant id column in each table.
>>
>
> Right
>
>>>
>>> Ideally, the management of the tenant id should be transparent to the
>>> application, although we should revisit this in Java EE 8 as we move
>>> further into support for SaaS.
>>
>> For the application to not have to manage tenant ids, I guess the
>> tenant identifier
>> would need to be available to the provider on a per-invocation basis
>> (in a thread
>> context set by the container)? As you mention, not something that we
>> necessarily have to worry about now, but just so we know what we will
>> need
>> in the future if this is what we want.
>>
>
> Right. For shared application instances we will need this.
>
>>> I believe that the main use case for the shared table approach is in
>>> SaaS environments in which a single application instance is servicing
>>> multiple tenants. This is outside the scope of Java EE 7, so I don't
>>> think that we need to standardize on support for this approach now,
>>> although we should not lose sight of it as we standardize on other
>>> aspects.
>>
>> Yes, there is some value in this being available today, though, given
>> that
>> some people are doing multitenancy in their own environment, outside the
>> cloud. I guess it just depends how far we want to go to enable SaaS
>> in JPA
>> in this round.
>>
>
> Right. This would be "application-managed SaaS" in my terminology :-)
> I'm not opposed to considering it in this release, although it depends on
> time constraints. But it might be advantageous to get more vendor
> experience
> first before we standardize.

I like the term application-managed SaaS.
Best way to describe what we are talking about.

>
>>> DETERMINING THE MULTITENANCY STORAGE MAPPING STRATEGY:
>>>
>>> We see two general approaches to determining the multitenancy storage
>>> mapping strategy that should be used for a persistence unit. In some
>>> cases, these approaches might be combined.
>>>
>>> Again, note that a cloud platform provider might use a single strategy
>>> for all tenants in allocating database storage. For example, each
>>> tenant might be allocated a separate database, or each tenant might
>>> only be allocated a schema within a database.
>>>
>>>
>>> (A) The Application Specifies Its Requirements
>>>
>>> In this approach, the application specifies its functional
>>> requirements (in terms of need for named, multiple schemas and/or use
>>> of native queries) in the persistence.xml descriptor, and the deployer
>>> and/or cloud platform provider determine the storage strategy that is
>>> used for the tenant. This metadata serves as input to the deployer
>>> for the tenant or as input into the automated provisioning of the
>>> application by the cloud platform provider (if automated provisioning
>>> is supported by the platform instance).
>>>
>>> For example, an application might specify that it requires support for
>>> multiple schemas and native queries. In general, such requirements
>>> would mean that a separate database would need to be provisioned for
>>> the tenant. If this is not possible, then unless the platform
>>> provider supported a persistence provider that could perform schema
>>> remapping and/or modification of native queries, the application might
>>> fail to deploy or fail to initialize. On the other hand, if an
>>> application specifies that it uses only the default schema and native
>>> queries, then either the separate database or separate schema approach
>>> could be used.
>>
>> I'm less enamored with this approach.
>> Although many cloud platforms are going to support both an internally
>> hosted DBaaS as well as access to an external DB, my guess is that
>> they won't
>> have multiple different ways of implementing their internally hosted
>> database
>> services (e.g. one as a separate DB and one with striped data). I
>> could be wrong,
>> but realistically I don't think a cloud provider is ever going to
>> implement a db
>> service using striping. As was mentioned above, a SaaS application
>> might decide
>> to use its database that way.
>
> I don't think a cloud provider would do this either.
>
>> Basically, the restriction that schemas not be used in portable cloud
>> apps is
>> enough, I think, for cloud applications. Any additional requirements
>> or relaxations
>> are cloud specific.
>>
>>> (B) The Application Specifies the Multitenancy Storage Mapping Strategy
>>>
>>> An alternative approach is that the application specifies the required
>>> (or preferred) multitenancy storage mapping strategy in the
>>> persistence.xml.
>>
>> This is a preferable approach, and even though it may not be
>> *necessary* for
>> cloud deployment, it would be nice to have these options so the
>> provider can do
>> some checking at deployment time rather than the app failing at runtime.
>> It would also provide a standard way of configuring for striping in
>> SaaS apps.
>>
>>> For example, a multitenant application that is designed with the
>>> intention that separate databases be used might indicate this in the
>>> persistence.xml as multitenancy = SEPARATE_DATABASE.
>>
>> In general I don't think they would even need to specify this, since
>> this is what
>> we already assume, isn't it?
>>
> But in a cloud environment that may not be the default storage option, so
> this would indicate that the application was written to *require* use
> of that
> strategy.

What does the absence of the property imply? That it would work with any
storage
option, or that it is undefined? I figured for the sake of backward
compatibility of existing
applications that did not specify this property (but, for example, used
a schema) that means
they would require a separate database, but I guess this property would
need to be set
to deploy such an app into a cloud?

>>> An application that is designed with the intention that databases may
>>> be shared by partitioning at the database schema level might indicate
>>> this in the persistence.xml as multitenancy = SHARED_DATABASE. [A
>>> portable application that specifies this strategy should not specify
>>> schema or catalog names, as it might otherwise fail to deploy or fail
>>> to initialize.]
>>
>> This probably doesn't matter, but although I find the terminology
>> easy to
>> understand, from a PaaS user perspective the line between 1 and 2 might
>> be a little fuzzy because most of the cloud providers have some kind of
>> "database service", but the capabilities of those services differ.
>> In some cases one can create db instances and schemas (SEPARATE DB), yet
>> and in other cases the tenant "database" is just a place to store
>> data, with a
>> default schema and no ability to create a new one (SHARED DB).
>>
>>> An application that is designed with the intention that tables be
>>> shared might indicate this in the persistence.xml as multitenancy =
>>> SHARED_SCHEMA. An app that uses explicit multitenant mapping metadata
>>> would be expected to specify this.
>>>
>>> [Open Issue: Is it useful to specify requirements along the lines of
>>> those used in approach (A) with this approach? If so, is the platform
>>> provider allowed to choose a different mapping strategy as long as
>>> that approach is more isolated? If no functional requirements are
>>> specified as in approach (A) and if a mapping strategy is specified in
>>> the persistence.xml that is provided by the application submitter,
>>> then if this information is not observed, the risk is that the app
>>> will fail. For example, observation of the specified mapping strategy
>>> might be required for the case where explicit multitenant mapping
>>> metadata is supplied for the striped mapping approach.]
>>>
>>>
>>> With both the approaches (A) and (B), different storage mapping
>>> strategies may be used for different tenants of the same application
>>> if the cloud platform provider supports a range of storage mapping
>>> choices.
>>>
>>>
>>> REQUIREMENTS FOR PORTABLE APPLICATIONS
>>>
>>> Applications that are intended to be portable in cloud environments
>>> should not specify schema or catalog names.
>>
>> This sounds very reasonable to me and solves 99% of the cloud JPA app
>> scenario.
>>
>>> DEPLOYMENT
>>>
>>> When an application instance is deployed for a tenant, the container
>>> needs to make the tenant identifier and tenant-related configuration
>>> information available to the persistence provider. The container
>>> needs to pass to the persistence provider a data source that is
>>> configured with appropriate credentials for the tenant, and which will
>>> provide isolation between that tenant and other tenants of the
>>> application. We should probably also define an interface to capture
>>> the tenant identifier and tenant-related metadata and configuration
>>> information that the container needs to pass to the persistence
>>> provider, e.g., a TenantContext.
>>
>> Again, this would definitely help to enable JPA in SaaS apps.
>>
>>> OTHER OPEN ISSUES
>>>
>>> 1. Additional metadata to support schema generation.
>>
>> We might want to rename this to what it actually does -- table
>> generation :-)
>>
>
> That would certainly be more precise for the separate schema case :-)
> But if the application instance "owned" the database, the persistence
> provider might
> be creating schemas.
>
>>> 2. Do we need metadata to indicate whether an application supports
>>> multitenant use -- i.e., whether it is "multitenant enabled"?
>>> Do we need this information specifically for JPA?
>>
>> Again, it is not strictly required for PaaS, but it would be really
>> nice to have it
>> so SaaS cound be enabled, even though it is not formally supported.
>>
>>> 3. Specification of resources that are shared across tenants--e.g.,
>>> a persistence unit for reference data that can be accessed by
>>> multiple tenants.
>>
>> I'm not sure we need to solve this problem at this stage. Multiple
>> tenants
>> accessing a shared read-only resource through identical JPA
>> configurations is
>> one thing, but having a single shared persistence unit spanning multiple
>> applications seems out of scope.
>
> I was assuming the former -- i.e., the tenants had identical
> persistence unit configurations
> for the resource, but no sharing at the EMF level -- i.e., separate
> EMF per application
> instance. Things certainly do get more interesting when we move in
> the SaaS case where
> application instances are multitenant :-)