[jpa-spec users] [jsr338-experts] Re: support for multitenancy

From: Linda DeMichiel <linda.demichiel_at_oracle.com>
Date: Tue, 27 Mar 2012 15:15:14 -0700

Hi Mike,

On 3/27/2012 11:47 AM, michael keith wrote:
> Hi Linda,
>
> Thanks for writing all this up.
>
> Some comments inline.
>
> -Mike
>
> On 26/03/2012 7:54 PM, Linda DeMichiel wrote:
>> One of the main items on the agenda for the JPA 2.1 release is support
>> for multitenancy in Java EE 7 cloud environments.
>>
>> In Java EE 7, an application can be submitted into a cloud environment
>> for use by multiple tenants in what can be viewed as a basic form of
>> software as a service (SaaS). The application is customized and
>> deployed on a per-tenant basis. At runtime, there is a separate
>> application instance (or set of instances, e.g., in a clustered
>> environment) per tenant. The instances used by different tenants are
>> isolated from one another. The resources used by a tenant's
>> application may also be isolated from one another, or may be shared.
>> In general, however, it is assumed that a tenant's data is isolated
>> from that other tenants.
>
> That seems like the right default assumption to make. A config option
> made to/by the resource consumer (in our case the JPA provider) would
> override that assumption, I suppose?
>

Whether a database was used by multiple tenants (with isolated data) would
be determined by the tenant (e.g., an SLA might stipulate a separate database,
assuming that option were available, or the tenant's deployer's might) or would
be determined by the cloud platform provider. I don't see the JPA provider
overriding, at least in Java EE 7 case, but maybe I'm missing what you intended.

>> There are three well-known approaches to support for multitenancy at
>> the database level:
>>
>> (1) separate database approach
>> (2) shared database / separate schema approach
>> (3) shared schema / shared table approach
>>
>>
>> To get the discussion started, this is a high-level strawman sketch of
>> how the 3 approaches might be used with JPA in keeping with the Java
>> EE 7 approach. At the same time, however, we also want to be sure
>> that what we specify in JPA 2.1 can be extended to encompass a more
>> general approach to SaaS in the future in which a single application
>> instance serves multiple tenants and in which multitenancy is managed
>> by the Java EE environment.
>>
>> For further information on how Java EE 7 is approaching PaaS/SaaS, you
>> might find the documents on the javaee-spec.java.net project useful,
>> particularly http://java.net/projects/javaee-spec/downloads/download/PaaS.pdf
>> and the latest draft of the Java EE 7 Platform spec,
>> http://java.net/projects/javaee-spec/downloads/download/JavaEE_Platform_Spec.pdf.
>>
>> Note that the identifier for the tenant will be made available to the
>> application in JNDI as java:comp/tenantId. The tenantId will be a
>> string, whose max length should allow it to be portably stored in a
>> single database column.
>>
>>
>> APPROACHES:
>>
>> (1) Separate database approach
>>
>> In this approach, each tenant's persistence unit is mapped to a
>> separate database. This approach provides the greatest isolation
>> between tenants and does not impose any additional constraints over
>> the object/relational mapping of the persistence unit or over the
>> operations that can be performed. In particular, the use of
>> multiple database schemas or catalogs are supported as are native
>> queries.
>>
>> In some cloud environments, use of this approach might not be
>> available, as a tenant might be allocated storage within a database
>> rather than a separate database.
>
> So this is basically what JPA assumes today.
>

Right

>> (2) Shared database / separate schema approach
>>
>> In this approach, each tenant's data is stored in database tables
>> that are isolated from those of any other tenant. In databases that
>> support schemas, this will typically be achieved by allocating a
>> separate schema per tenant. The database's permissions facility is
>> used to confine a tenant's access to the designated schema, thus
>> providing isolation between tenants at the schema level.
>>
>> Support for this approach is straightforward if the persistence unit
>> uses only the default schema or catalog (i.e., if it does not specify
>> schema names or catalogs in the object/relational mapping metadata).
>> A native query that attempts to access data in a schema other than
>> that assigned to the tenant by the platform provider will be trapped
>> by the database authorization mechanisms and will result in an
>> exception.
>>
>> [While the case where the persistence unit metadata explicitly
>> specifies one or more schemas could potentially be handled by the
>> persistence provider by remapping schema names and native queries that
>> embed schema names, I would not propose that we specify or require
>> support for this case, although a more sophisticated persistence
>> provider might choose to support it.]
>
> So, in summary, portable apps may not specify a schema or catalog at any level:
> mapping (annotation or XML), mapping file, persistence unit default, or in a native query.
>

Right

>>
>> (3) Shared table approach
>>
>> In this approach, database tables are shared ("striped") across tenants.
>>
>> It is the reponsibility of the persistence provider to provide
>> per-tenant isolation in accessing data. This will typically be done
>> by mapping and maintaining a tenant ID column in the respective
>> tables, and augmenting data retrieval and query operations, updates,
>> and inserts with tenant IDs. The use of native queries would need to
>> be trapped by the persistence provider and not allowed unless the
>> persistence provider were able to modify them to provide isolation of
>> tenant data.
>
> So, portable applications could not use either schemas or native queries in
> this mode, and there will be an opportunity for the application to be able
> to map the tenant id column in each table.
>

Right

>>
>> Ideally, the management of the tenant id should be transparent to the
>> application, although we should revisit this in Java EE 8 as we move
>> further into support for SaaS.
>
> For the application to not have to manage tenant ids, I guess the tenant identifier
> would need to be available to the provider on a per-invocation basis (in a thread
> context set by the container)? As you mention, not something that we
> necessarily have to worry about now, but just so we know what we will need
> in the future if this is what we want.
>

Right. For shared application instances we will need this.

>> I believe that the main use case for the shared table approach is in
>> SaaS environments in which a single application instance is servicing
>> multiple tenants. This is outside the scope of Java EE 7, so I don't
>> think that we need to standardize on support for this approach now,
>> although we should not lose sight of it as we standardize on other
>> aspects.
>
> Yes, there is some value in this being available today, though, given that
> some people are doing multitenancy in their own environment, outside the
> cloud. I guess it just depends how far we want to go to enable SaaS in JPA
> in this round.
>

Right. This would be "application-managed SaaS" in my terminology :-)
I'm not opposed to considering it in this release, although it depends on
time constraints. But it might be advantageous to get more vendor experience
first before we standardize.

>> DETERMINING THE MULTITENANCY STORAGE MAPPING STRATEGY:
>>
>> We see two general approaches to determining the multitenancy storage
>> mapping strategy that should be used for a persistence unit. In some
>> cases, these approaches might be combined.
>>
>> Again, note that a cloud platform provider might use a single strategy
>> for all tenants in allocating database storage. For example, each
>> tenant might be allocated a separate database, or each tenant might
>> only be allocated a schema within a database.
>>
>>
>> (A) The Application Specifies Its Requirements
>>
>> In this approach, the application specifies its functional
>> requirements (in terms of need for named, multiple schemas and/or use
>> of native queries) in the persistence.xml descriptor, and the deployer
>> and/or cloud platform provider determine the storage strategy that is
>> used for the tenant. This metadata serves as input to the deployer
>> for the tenant or as input into the automated provisioning of the
>> application by the cloud platform provider (if automated provisioning
>> is supported by the platform instance).
>>
>> For example, an application might specify that it requires support for
>> multiple schemas and native queries. In general, such requirements
>> would mean that a separate database would need to be provisioned for
>> the tenant. If this is not possible, then unless the platform
>> provider supported a persistence provider that could perform schema
>> remapping and/or modification of native queries, the application might
>> fail to deploy or fail to initialize. On the other hand, if an
>> application specifies that it uses only the default schema and native
>> queries, then either the separate database or separate schema approach
>> could be used.
>
> I'm less enamored with this approach.
> Although many cloud platforms are going to support both an internally
> hosted DBaaS as well as access to an external DB, my guess is that they won't
> have multiple different ways of implementing their internally hosted database
> services (e.g. one as a separate DB and one with striped data). I could be wrong,
> but realistically I don't think a cloud provider is ever going to implement a db
> service using striping. As was mentioned above, a SaaS application might decide
> to use its database that way.

I don't think a cloud provider would do this either.

> Basically, the restriction that schemas not be used in portable cloud apps is
> enough, I think, for cloud applications. Any additional requirements or relaxations
> are cloud specific.
>
>> (B) The Application Specifies the Multitenancy Storage Mapping Strategy
>>
>> An alternative approach is that the application specifies the required
>> (or preferred) multitenancy storage mapping strategy in the
>> persistence.xml.
>
> This is a preferable approach, and even though it may not be *necessary* for
> cloud deployment, it would be nice to have these options so the provider can do
> some checking at deployment time rather than the app failing at runtime.
> It would also provide a standard way of configuring for striping in SaaS apps.
>
>> For example, a multitenant application that is designed with the
>> intention that separate databases be used might indicate this in the
>> persistence.xml as multitenancy = SEPARATE_DATABASE.
>
> In general I don't think they would even need to specify this, since this is what
> we already assume, isn't it?
>

But in a cloud environment that may not be the default storage option, so
this would indicate that the application was written to *require* use of that
strategy.

>> An application that is designed with the intention that databases may
>> be shared by partitioning at the database schema level might indicate
>> this in the persistence.xml as multitenancy = SHARED_DATABASE. [A
>> portable application that specifies this strategy should not specify
>> schema or catalog names, as it might otherwise fail to deploy or fail
>> to initialize.]
>
> This probably doesn't matter, but although I find the terminology easy to
> understand, from a PaaS user perspective the line between 1 and 2 might
> be a little fuzzy because most of the cloud providers have some kind of
> "database service", but the capabilities of those services differ.
> In some cases one can create db instances and schemas (SEPARATE DB), yet
> and in other cases the tenant "database" is just a place to store data, with a
> default schema and no ability to create a new one (SHARED DB).
>
>> An application that is designed with the intention that tables be
>> shared might indicate this in the persistence.xml as multitenancy =
>> SHARED_SCHEMA. An app that uses explicit multitenant mapping metadata
>> would be expected to specify this.
>>
>> [Open Issue: Is it useful to specify requirements along the lines of
>> those used in approach (A) with this approach? If so, is the platform
>> provider allowed to choose a different mapping strategy as long as
>> that approach is more isolated? If no functional requirements are
>> specified as in approach (A) and if a mapping strategy is specified in
>> the persistence.xml that is provided by the application submitter,
>> then if this information is not observed, the risk is that the app
>> will fail. For example, observation of the specified mapping strategy
>> might be required for the case where explicit multitenant mapping
>> metadata is supplied for the striped mapping approach.]
>>
>>
>> With both the approaches (A) and (B), different storage mapping
>> strategies may be used for different tenants of the same application
>> if the cloud platform provider supports a range of storage mapping
>> choices.
>>
>>
>> REQUIREMENTS FOR PORTABLE APPLICATIONS
>>
>> Applications that are intended to be portable in cloud environments
>> should not specify schema or catalog names.
>
> This sounds very reasonable to me and solves 99% of the cloud JPA app scenario.
>
>> DEPLOYMENT
>>
>> When an application instance is deployed for a tenant, the container
>> needs to make the tenant identifier and tenant-related configuration
>> information available to the persistence provider. The container
>> needs to pass to the persistence provider a data source that is
>> configured with appropriate credentials for the tenant, and which will
>> provide isolation between that tenant and other tenants of the
>> application. We should probably also define an interface to capture
>> the tenant identifier and tenant-related metadata and configuration
>> information that the container needs to pass to the persistence
>> provider, e.g., a TenantContext.
>
> Again, this would definitely help to enable JPA in SaaS apps.
>
>> OTHER OPEN ISSUES
>>
>> 1. Additional metadata to support schema generation.
>
> We might want to rename this to what it actually does -- table generation :-)
>

That would certainly be more precise for the separate schema case :-)
But if the application instance "owned" the database, the persistence provider might
be creating schemas.

>> 2. Do we need metadata to indicate whether an application supports
>> multitenant use -- i.e., whether it is "multitenant enabled"?
>> Do we need this information specifically for JPA?
>
> Again, it is not strictly required for PaaS, but it would be really nice to have it
> so SaaS cound be enabled, even though it is not formally supported.
>
>> 3. Specification of resources that are shared across tenants--e.g.,
>> a persistence unit for reference data that can be accessed by
>> multiple tenants.
>
> I'm not sure we need to solve this problem at this stage. Multiple tenants
> accessing a shared read-only resource through identical JPA configurations is
> one thing, but having a single shared persistence unit spanning multiple
> applications seems out of scope.

I was assuming the former -- i.e., the tenants had identical persistence unit configurations
for the resource, but no sharing at the EMF level -- i.e., separate EMF per application
instance. Things certainly do get more interesting when we move in the SaaS case where
application instances are multitenant :-)